arXiv: math. PR/0000512 



On the trasductive arguments in 
statistics 

Ya'acov Ritov*, 

Department of Statistics, The Hebrew University, 91905 Jerusalem, Israel e-mail: 
yaacov . ritovSgmail . com; url: http : //pluto . msec . huj i . ac . il/~yaacov 

Abstract: The paper argues that a part of the current statistical discus- 
sion is not based on the standard firm foundations of the field. Among 
the examples we consider arc prediction into the future, semi-supervised 
classification, and causality inference based on observational data. 

Keywords and phrases: Foundations, Time series, Causality, Counter- 
factual, Semi-supervised learning. 

1. introduction 

Let Yi, . . . , It be some time series. At time T we want to predict the value of 
Yt+i- This looks like a standard statistical problem, feasible under an assump- 
tion of enough stationarity in the sequence. For example, it may be assumed 
that St = Yt — PiYt-i — (32Yt-2, t = 3,4, . . . are i.i.d. The vahic l^ +i is going 
to be predicted by (StiYt + $t2Yt-i, where $ti,$t2 are estimated based on 
the sequence Yi,...,Yx. Does this practical thinking have a good statistical 
foundations? 

Another example. Let {Xi,Yi), . . . , (X„, y„) be an i.i.d. sample, where Yi e 
{0, 1} is a label attached to observation i. Suppose we also have a large sample 
of unlabeled data X„+i, . . . , Xn, where N ^ n. Can we use the unlabeled data 
when we want to find a good classification rule? Can we justify this algorithm? 

Finally, let {Xi,Yi), . . . , {Xn,Yn) be a simple random sample, and {X,Y) 
another copy, where, for simplicity, Xi e {0, 1}. We want to test whether X is 
the cause of Y. Meaning, if we enforce X to be 0, then the distribution of Y will 
be different than if X will be manipulated to be 1. Can this test be devised? 

These three examples are typical to what we consider as a transduction in- 
ference, and we believe that most likely this type of inference go beyond the le- 
gitimate boundaries of standard statistical theory. In these examples, the statis- 
tician is extrapolating outside the observed model, to make prediction based on 
ungrounded belief. 

A remark. The term statistics, as used in this paper, and its different deriva- 
tives like statistician, arc not restricted to the research done by members of 
departments of statistics, who were also students in such a department. We 
mean by this anything related to inference done on the basis of empirically 
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collected data, and by scientists that are dealing with anything from machine 
learning to ncuroscicncc. 

In the next section we will lay some of the standard foundation of the sta- 
tistical practice. In Section 3 we explain what we mean by transduction. In 
the following three sections wc describe in detail the three examples mentioned 
above. Some issues in time series are discussed in Section 4. Inference with 
partially labeled data is described in Section 5, and Section 6 deals with the 
causality argument. Some concluding remarks are given in Section 7. 

2. Background: the theoretical foundation of statistic inference 

Statistical inference is based on an experiment. The basic notion of statistics is 
based on the collection of data, the understanding that at least in some sense 
the data could be different, and then, finally, making a statement which goes 
beyond the description of the actual observed data. Formally, the (statistical) 
experiment E is built out a few elements. There is a set fl, endued with a sigma 
field J" and, a random element Z which is measurable J". There is a parameter 
set O and a family of probability measures {Pg : 9 G O}, such that for every 
9 G Q, (ri, J^, Pe) is a probability space. We observe Z and assume that it follows 
Pg for some 6 G Q. See, for example, Berger and Wolpert (1988). 

The parameter 9 is unknown, and is not directly observed. Some may call it 
the state of nature (Berger, 1993). In some very restrictive sense they are right. It 
certainly is some parameter of reality. Whatever it is, we want to infer something 
about it from the observable X. The statistician is reporting the evidence about 
9 arising from the experiment E and Z (Birnbaum, 1962, Berger and Wolpert, 
1988). 

Different authors difFc;r on the scope of the statical conclusion. Le Cam (1986) 
believes that "each element of the set & represents a particular 'theory' about 
the physical phenomena involved in the experiment." In contrast, Bickel and 
Doksum (2000) write "Our aim is to use the data inductively, to narrow down 
in useful ways our ideas on what the 'true' P is." The difference between these 
two points of view may not seem apparent. We believe they answer differently 
the question of how far from the data the statistician can go. Lc Cam speaks 
about a theory underlined the data, and hence the conclusion from the experi- 
ment may go very far from the data, as far as the theory reaches, while Bickel 
and Doksum believe the statistician is limited to generalizations from sample 
to population. Thus, Cox (1958) argues that "a statistical inference carries us 
from observations to conclusions about the populations sampled." He contrasts 
this with "a scientific inference in the broader sense [which] is usually concerned 
with arguing from descriptive facts about populations to some deeper under- 
standing of the system under investigation." However, the leap from data to the 
deep understanding of Lc Cam's theory about Bcrgcr's state of nature, is not 
statistical, or empirical, and most likely needs a leap of faith. 

Thus, Tukey (1960) points to the "difference between 'statistical conclusions' 
and 'experimenter's conclusions'." Statisticians aim at precise statements, hence 
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Tukey continues and claim that "Both the statistician morale and integrity are 
tested . . . when he has to face the possibility of a really substantial systematic 
error just after he has used all his skill to reduce, . . . the effects of fluctuating 
errors to 95% of their former value." The difficulty is when we go beyond the 
population, the safety net we create and proud of, like precise confidence interval, 
and P-values, are in doubt. 

One of the basic concepts of the field is that any precise statement on "& is 
impossible. Statistical inference is done with error. In other words, a particular 
inference is rarely valid. Still, the field is proud in being able to make precise 
statements about the error. Whether this is done by presenting the object of 
inference as a random variable with a Baycsian a-posteriori distribution, or with 
a frequentist confidence interval, the result is a precise quantification of the 
inference error. However, if the conclusion is derived with unknown 'systematic 
error', then one may doubt the importance of the exact quantification of the 
'statistical error'. 

Statisticians are well aware of the danger of extrapolation. For a very cute 
example consider the prediction of a newborn car length. Altman and Bland 
(1998) used the regression line presented in Heathcote (1995): ( Ear_Length = 
55.9 + 0.22 X Age ) where ear length is measured in millimeter and age in years. 
This equation, based on a sample, predicts a car length of 55.9mm at age 0, 
which is an absurd. The solution is simple: The regression line was found by 
measuring the ear length of a sample of 30-93 years old men. A minimum de- 
mands from a proper statistical inference is that it will be supported by the 
data. Extrapolation is going beyond the data, and hence it is considered prob- 
lematic. In the next section we consider a much further reaching extrapolation, 
in which the statistician is going not only beyond the data, but also beyond the 
model. 

3. From induction and deduction to transduction 

We argued that the legitimate statistical argument is from sample to popula- 
tion. For example, predicting the value of a random variable taken from the 
same distribution as the observed i.i.d. sample. We refer to this type of sta- 
tistical inference by induction. A statistical deduction is the inference about 
the parameter describing the population from which the sample was taken. The 
difference between these two may be considered verbal, pointing to two differ- 
ent perspectives on the same object. Consider for example a Gaussian mixture 
model: X = (Xi, . . . ,X„) £ R", = F^, and = J2Li "»Ar(/i,, 1), where 
, •& = {a, n): a is a. point in the k dimensional simplex and /i S M*^. The esti- 
mation of F^{x) is what we call an induction, while the estimation of a is our 
deduction. 

The problem of the justification of the statistical induction and deduction is 
on the one hand too simple to dwell on it, and on the other it is too deep and 
goes beyond the scope of this note. 

Our concern is another type of statical inference, which we call transduction. 
When transduction is used, the statistician goes well beyond the data. This 
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type of inference was relatively rare when the standard model was simple. If the 
statistician believe in the i.i.d. shift normal model, there is very little that one 
can infer about except for the value of the mean (and maybe the variance) of 
the population distribution. When statistician started to work on more complex 
data structure, the possibilities and the dangers arc abundant. 

In fact, what is done here is extrapolating beyond the observed model, and 
not only beyond the range of the values observed in the sample. If the latter is 
dangerous, the former is much more so. 

More formally. Let v : ^ M he parameter of interest. Let Iv{X; E) be any 
decision about v{'d) done in the context of the experiment E. Statistician are 
trained to report the distribution of I,y{X; E). Thus, the 95% confidence interval 
is defined as a (random) set, which if the experiment will be repeated again and 
again, X' ~ F^., i = 1, . . . , iV, and we will decide I^{X'^;E), . . . ,I^{X^ -E) 
then the cardinality of the set {« : G I^{X'^\E)} is approximately 0.95N. 
Taking care of the danger of extrapolation is a slightly more fuzzy. It means 
basically restricting the set of 'legitimate' questions v the statistician may ask, 
and at least avoiding fimctions v such that is not a smooth function of F^. 

The gedankenexperiment described above is what enables us to run away from 
the particular inference, on which we usually can say very little to the general 
scheme, which can be exact to a known degree. 

However, in much of modern statistical theory, the gedankenexperiment con- 
sidered is different. In fact, the argument is based on conceiving a list of ex- 
periment. E^jE"^, .... The claim is, this type of logic worked in the past, why 
would it not work in this particular setup? Assuming that P{Y\X) is smooth 
relative to the distribution of X, worked in this best typical (hardly related) 
models, why wouldn't it work for this new technique, for this very different 
model in a completely different field? We were able to prove that smoking is 
bad for your health, why wouldn't the proof that hormone replacement therapy 
is a wonderful cure of the pitfalls of the middle age be valid? 

In the following sections we will discuss in some length the examples given 
in the introduction. 

4. Time series prediction 

Analyzing the role of the economists in the recent economic crises, Paul Krug- 

man wrote "the professions blindness to the very possibility of catastrophic fail- 
ures in a market economy. During the golden years, financial economists came 
to believe that markets were inherently stable indeed, that stocks and other 
assets were always priced just right. There was nothing in the prevailing mod- 
els suggesting the possibility of the kind of collapse that happened last year" 
(Krugman, 2009). As he saw it, "the economics profession went astray because 
economists, as a group, mistook beauty, clad in impressive-looking mathemat- 
ics, for truth." We want to put this in a more general context (however, we do 
equate mathematical beauty with truth). 
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4-1- Two problems of predictions 

We should distinguish between two very different problems. 

Prediction problem 1 (PPl): Suppose Yi, . . . ,Yt,Yt+i G (—00,00). We 

observe Yi,. . . ,Yt and want to predict Vr+i- A good predictor is such that 

L{T) = {YT+i-YT+if 

is small. 

Prediction problem 2 (PP2): Suppose that Yi,Y2,.. 
we want to predict Yt+i using its past, Yi, . . . ,Yt, for t 
that a predictor is good if 

L{T) = j2iyt+i-Yt+ir 
t=i 

is small. 

There are a few critical differences between these two problems. The first is 
that the first problem deals with a single event, while the second deals with a 
repeated one. As a result, the first problem asks for an unverified method, while 
the second problem asks for a method that can be checked and corrected as more 
data come in. The final difference between the two problems, is that the second 
has a built-in guarantee against catastrophe. The range of Yt was restricted to 
the unit interval. We believe that the first problem is less legitimate statistical 
problem than the second. 

The main assumption underlined PPl is that the future is in some sense like 
the past. In some sense, since, typically we observe a dynamic system. This 
assumption, is not statistical. Of course, PPl is well grounded if the prediction 
is based on a well verified physical theory. However, this is rarely the case. 
In the typical case, something like an ARMA model is going to be fitted to 
the existing data. Theory, if exists at all, is based on similar models used in the 
past, where they seemingly worked nicely. In either case we are in the E^,E'^, . . . 
Gedankenexperiment, which prevents any possibility of giving sense to confidence 
intervals, or anything alike. The danger of the Black Swan phenomena is there, 
and it is real, Taleb (2007). 

PP2 is different. One is not interested only in the flock, and anyway, the 
range of their color is restricted from white to bright gray. One needs very weak 
assumptions in order to solve it. In fact a very strong result can be claimed. 
We regress now into this model with a great detail. PP2 is directly related to 
the problem of self-calibrated deterministic forecaster, cf. Dawid (1985), Oakes 
(1985), Foster (1999), Foster and Vohra (1998), and Fudenberg and Levine 
(1999). 

4-2. Regression. A solution to PP2 

One can start with any presumed estimator. E.g., the auto-regression model, 
and fit the parameters, then unbiased the prediction, given the past prediction 
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experience. The result will be an estimator that predicts without significant bias 
the next observations, and preforms well along the sequence. This is so, without 

any assumption on the generator of Fi,l2, In fact, the Y sequence can be 

generated by somebody who knows the algorithm used by the statistician and 
tries to fool him. Here are some details. 

Suppose first that we are observing the series Zi,Z2,. ■ ■ , and we want to 
construct an unbiased predictor such that E{Zt+i — Zt+i | Zt+i) = 0. To be 
more precise, we consider a sort of zero-sum game in which the the statistician 

chooses functions ft : [0,1]* — > [0,1], t = 1,2, His opponent chooses a 

sequence Zi,Z2,. . ■ (not necessarily randomly. The statistician predicts Z^+i by 
Zf^i = ,ft{Zi, . . . , Zt). Let e > 0. The statistician wins if for any z, \rt{z) — z\ > e 
a finite number of times where 



j:l=iZMZt-z\<e) 
Y!s=iZsl{\Zt-z\<e) 



The suggested predictor slowly walks on a grid according to a moving av- 
erage of the observations. Let <S = {^o, • • • £,j = 2j?7, t} = 1/2K, be the 
possible forecast values, and let T be some large number, T ^ K. Informally 
our procedure is as follows. If we switch to the decision at the time U, we 
stick to this decision for a while. We give the history that lead to a weight 
of T, and accumulate new data. When the weighted mean deviates from by 
more than rj, we walk to ^j±i. 

Formally, let (?io € «S and to = 1- For j > define 



t 



T + t-ti ^ 
where the infimum of an empty set is infinite, and 



= inf| 



2r?, 
2r?, 



> 



< 



V, 



Note that (/>i e <S, i = 1, 2, The suggested forecast is Zt+i = (pi for ti <t < 

U+i- 
Let 

Aj = [J{ti, . . .,ti+i - 1 : (j)i=^j k. = 

Bj = [J{ti,...,ti+i-l: = ^j-i k = ^j}, j = l,...,K, 



and 



Rt 



EU,Zs1{sGA,UB, 



t=l,2,...,j = l,...,K. 



Then we have: 
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Theorem 4.1 Under the above strategy, if all oft\,t2, ■ ■ ■ are finite, then, for 
a"'?/ j = Ij • • • I -^7 either Zt+i = £.j a. finite number of times, or \Rtj — > 
2/K + K/T a finite number of times. 

We return now to the the sequence Yi,Y-2,... of PP2. We start with It+i, 
a favorite predictor. For example, in rain prediction Yt+i can be based on any 
time series methods apphed to the rain history, with or without the informa- 
tion about the global weather map. The next stage is approximately unbiased 
it. The canonical construction of the final predictor is based on a partition 
{Ai,. . . , Am} of the possible values of Yt+i- For m = 1, . . . , M let Z^i, Zm2i ■ ■ ■ 
be the subsequence {Yt : Yt e Am}- Apply the above scheme separately for each 
of these subsequences to obtain Zmt- Then finally we use the predictor Fj+i = 
Em ZmTr^^t)MYt+i e Am), where T„(t) = #{s : s < i and Fj+i e Am}. 

4-3. Conclusion: the difference between PPl and PP2 

The compound decision approach of PP2 saved the analysis, because it moved 
us from a particular inference (about Yt, for q very particular T), to the general 
inference setup (about Yi, 12, • • • )• Also, we have restricted the possible values 
of the y's to avoid catastrophe by missing a single event. 

We do not argue that prediction in time series is impossible. It is possible 
when it is a general scheme done under control (i.e., like PP2). We believe that 
predicting the future, that is, predicting one most important future event, is not 
a statistical task. 

5. Semi-supervising classification. 

It is very frustrating to waste good data, even when it is hardly related to 
the problem at hand. It is very tempting to use them, even if an unverified 
transductive argument is used in justifying the exercise. 

5.1. Can Classification be based on clustering? 

Consider a standard classification problem. A unit is characterized by a vector 
of variables X and a label Y . The statistical task is to find a classification 
rule by which the label Y can be predicted based on the value of X. It is 
quite standard that the data arc collected electronically but judged by humans. 
It may then happen that most data points are not labeled. This is the semi- 
supervised situation, and typical examples are satellite photography of terrains, 
or electronic espionage, but more down to earth examples exist, e.g., results of 
routine medical test may be in abundant, but a relatively few patients passed a 
through examination. 

Formally, let (Xi, Yi), . . . , (X„, y„) be an i.i.d. sample, where Yi G {0,1} 
is a label attached to observation i. Suppose we also have a large sample of 
unlabeled data Xn+\, ■ . ■ , X^, where N ^ n. In fact. A'' may be large enough 
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Fig 1: (a) The classified data; (b) Unlabeled data was added; (c) The boundary 
of the Bayes classifier was added 



to be considered infinity for all practical purposes. See, for example, Joachims 
(1999), Belkin and Niyogi (2004), Amini and Gallinari (2005), and references 
given there. 

Semi-supervising classification is tempting. Consider Figure 1(a). The prob- 
lem is finding the best classifier. Geometrically, we want to find the boundary 
between the area were most of the slashes are normal to area where there are 
mostly the back slashes. It is not an easy task. In Figure 1(b) we added unlabeled 
points. Now, it seems not that difficult to find the boundary. Since the figures 
were created by simulation, we can add the optimal boundary corresponding to 
the Bayes classifier. It is given in Figure 1(c), lying exactly where most readers 
thought it should be. 

Here is an example where it is easy to see what is going on^. Suppose 
Ui,U2, ■ ■ . ,uk are i.i.d. exponential with mean cr, vi,V2, ■ ■ ■ ,vji are i.i.d. ex- 
ponential with mean r. Let ai < &i < 02 < 62 < • • • < ax < 6^, be such that 
bi — Qi = Ui/{Y^ Uj -\- ^ Vj), and a^+i — bi = Vi/{^ Uj + ^Vj). Now let the co- 
variatc Xi, X2, . . . , Xn be i.i.d. uniform on U(ai, bi), the variables pi,pi, ■ ■ . ,pk 
be i.i.d. P{p, = 1) = P{pt = 0) = 1/2. Finally let the labels Fi, . . . , r„ be inde- 
pendent P{Yi = l\X) = X^Pi-^l^ ^ (aji^i)|- Consider now the asymptotic as 
n — ^ cx), n/N -> 0, n/K — ^ 7 > 0, cr/r — > 00, and Nt/ log{N)K 00. 

The length of the support is 1, and there are N observations uniformly 
distributed in it. Hence the largest spacing between observations belonging 
to same interval is Op{\og{N)/N). The distance between adjacent intervals is 
approximately exponential, and hence the minimal distance between two ad- 
jacent intervals of the support is Opir/K). The ratio between these two is 
Op{log{N)K/NT) = Op{l). It follows, therefore, we can know where the spac- 
ing between any pair of adjacent unclassified observations is small and the two 
members belong to the same interval, and where the spacing is large and they 
belong to different intervals. Let Aj and Bj be the smallest and largest observed 

^This discussion was conceived in a discussion with Nicolai Meinshausen and Peter. 
Biihlmann 
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X inside the interval {aj,bj). Then ^\Bj — An\/^\bj — aj \ 1. Hence with 
the unclassified data it is easy to reconstruct the interval structure. If 7 is large, 
an almost perfect classifier can be constructed: Y{x) = Yn^^) where i{x) is the 
first i e {!,..., m„} such that Xi and x are in the same interval [Aj.Bj], and 
1/2 otherwise. 

Let Y(-i)), . . . , be the fully observed sample sorted such that 

< X(2) < • • • < X^^ri) - Since cr/r — >■ 00, most of the (0, 1) interval belongs to 
the support. Further, 7 is finite, hence we cannot know whether and 
belong to the same interval or not. If Y(i) 7^ we know that P{Y = l\x) 

is not constant on the interval , but not much more than that. 

The change point is almost uniformly distributed (it is if 7 is very small). If 
Y(j) = ^(i+i), we know that it is quite likely that P(Y = l\x) is not constant on 
the interval In any case we cannot know perfectly well E(y|X) 

for a random X. 

Clearly the classification error of an estimator based only on the unlabeled 
data is 1/2 — there is no way to know the values of the p's given only the X's. 
At the same time, any classifier based only on the label points is very weak 
and its classification error is close to the maximal 1/2. However, the classifica- 
tion error of an classifier based on all the data is close to the minimal 0. The 
semi-supervising approach works because the way P{Y = l\x) is constructed is 
strongly tied to the way the support of X is constructed, and the statistician 
knows these ties pretty well. 

Can we justify the transduction from the experiment with observations X 
to the experiment with {X,Y)1 From the distribution of X to the conditional 
distribution of Y given XI Certainly the answer is yes, when simulations are 
done. However, can we give a statistical (empirical) justification for that? The 
answer to this difficulty is usually, something like "see, it worked these many 
times, in this many best typical examples." We suggest to take these answers 
with the same grain of salt, as answers who dismiss the need of confidence sets, 
or P-values (or their Bayesian counterparts), because "rejecting the null was 
typically successful" . 

The transduction argument was successful in the above two simulations, be- 
cause, the data was generated by a mechanism that tied together the value of the 
regression function, and the underlined covariate distributions. Note, however 
the following pseudo-theorem. Cf. Bickel, Klaassen, Ritov and Wellner (1998) 
for a precise formulations and examples. 

Meta-theorem 5.1 Suppose (X, "K), (Xi, Yi), (X2, F2) • • • 7 are i.i.d.. Suppose 
that X ^ H, P{Y = l\X = x) = p^,v{x), {H, u) GHxAf. Then, the semipara- 
metric bound for estimating 1? = is the same whether the distribution H 
of X is known or unknown. 

As we read the theorem it implies that there is no data dependent way to use the 
covariate distribution in a regular inference about the conditional distribution, 
at least locally and at the \/n rate. Any use of the covariate distribution, is based 
on a-priori assumed connection and cannot be quantified or justified empirically. 
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5.2. Preprocessing PC A 

Preprocessing PCA is another version of the semi-supervised transduction, cf. 
Jolliffe (1972), Jolliffe (1973), and Cook and Forzani (2008). Suppose that 
X,Xi, . . . , arc i.i.d., X E M.P. This sample is partially labeled with a y € M, 
such that {X, Y), {Xi, Yi, ), . . . , {Xn, Yn) are i.i.d., and {X, Y) follows a linear 
regression model. We consider the case where both N ^ n and n. A com- 
mon practice when there are too many variables, which is readily available in 
the semi-supervised state of mind, is reducing the number of variables using 
PCA (principal component analysis). This can be done using the all X sample. 
After the PCA, we can regress Y on the dominant s main principal compo- 
nents, where s <C n, and is chosen either a — priori, or in a data depended way 
(depending on Xi, . . . , X^)- 

The logic is irresistible. If X = {X^,...,XPy and cot{X'^,X^) « 1, we 
certainly can retain only the average {X^ + X'^)/2, and ignore their difference. 
Thus reducing the number of variables by one. No much information is lost 
by this. However, this logic is transductive, and cannot be justified using the 
experimental X data. It again based on some magic connection between the 
marginal distribution of X and that of Y given X. It appeals to some principal 
that says roughly that nothing is accidental. But this appeal to 'justice' is clearly 
fallible. It may that F is a function of the mismatch between X^ and X^, and 
not so much or their conjunction. For example, let be the optical power of 
the lenses a patient needs (in diopters), and X^ those he actually uses. Luckily 
they are highly correlated, but presumably headache is caused by their small 
difference. For another example, tension within a couple maybe caused more by 
education difference than by education average. 

5.3. Saving the transduction 

We have a real problem if by semi-supervised learning we understand what 
Gammerman, Vovk and Vapnik (1998) refer to by: "This is the problem of 
transduction in the sense that wc arc interested in the classification of a par- 
ticular example rather than in the general rule for classifying future examples." 
However, with large data set a different point of view can be considered, and 
a more gentle interpretation is possible. Suppose we want to use the sample 
for prediction. The best predictor is nonparametric, and a-priori belongs to a 
very large set of potential predictors, too large to make useless any empirical 
risk minimizer which is based on the unlabeled data. However, one may use the 
unlabeled sample essentially for suggesting a small set of potential predictors. 
The final predictor to be chosen, is going to be selected from these potential 
predictors, and this selection is going to be done solely based on the labeled 
sample. This predictor can be compared to the predictor which is based solely 
on the unlabeled data, and the best of them can be used. In this way, if the 
unverifiablc assumptions used for the semi-supervising reasoning are approxi- 
mately valid, then they will be utilized, and if they yield a bad predictor, they 
will be discarded. 
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Let us repeat. The unlabeled data are used only in suggesting potential pre- 
dictors, and not in the decision on the final predictor. 

A similar argument can be used in the preprocessing PCA and the modified 
method is easily defensible. Co-linearity is not really a problem for prediction. 
With LASSO like techniques (Tibshirani, 1996), p ^ n can be handled, as 
long as there is an approximation with only Pq <^ n non-zero coefficients, cf. 
Greenshtein and Ritov (2004). PCA preprocessing can be used to generate a new 
set oi q, q p variables ..... Z"^ . One can regress Y on Z^ , . . . , Z"^ , but for 
this one needs extra assumptions. One can regress Y on Z^, . . . , X^, . . . , Xp 
with very little extra assumptions. Then if the PCA step happened to be smart, 
it will be effective, and if information was lost in the PCA reduction, it will be 
regained. 

6. Counterfactual causality 

The counterfactual theory of causality, (cf. Rubin (1974), Holland (1988) and 
others) 

• Each individual is characterized by two outcomes (^-^,1^^). One under 

the control condition and one under the treatment condition. 

• The "causal effect" is the difference between these two potential outcomes. 



• However, as mentioned, only one of these potential outcome is observed. 
The observation on subject i is {Yi,Ti, Xi), where 



For example, each participant carries two outcomes (from birth?), the first 
would be expressed if he will smoke all his life, and the other if he wouldn't. 
But the same subject may participate in another experiment, and therefore he 
has another couple of outcomes, where the first outcome will be measured if he 
would learn German in one type of program, and the other will be expressed if 
he would learn it by another. 

The model can be summarized by = Yf + Ti6i, T e {0, 1} Whe n we are 
dealing with a well designed experiment with a random allocations of units, T 
is independent of {Y'-' ,Y'^), and the mean causal effect is easily estimated 



However, this metaphysics is used exactly when T is not exogenous and in 
particular it is not independent of {Y'-^ , Y"^). The different solutions of this basic 
"difficulty" are based on the assumption that T is independent of {Y'-^ ,Y^) 
conditional on a linear function of X. 

Using a heavily loaded metaphysics in a naturally positivistic science as statis- 
tics is justified when either 



i.e., 6 = Y^-Y^ 



Yi = Yf +UY^ -Yl 



), re 



{0,1}. 




Ya'acov Ritov/On Transduction 



12 



1. It justifies in one sharp Ockhamian cut many many problems. 

2. It really simplifies the analysis. 

3. It unifies the terminology. 

Neither of these conditions is satisfied here. First, is the ax sharp? Can the 
counterfactual theory of causality contribute anything a simple model cannot? 
What would be the case if the treatment is continuous? In reality, most "treat- 
ments" are continuous even if measured as either-or. People are not either pas- 
sive smoker or passive non-smoker, study to the exam or come unprepared, 
either take the drug or not. In medical experiment, even if the control condition 
is objectively defined, the experiment condition is typically arbitrary chosen 
from a continuous set of doses, treatment durations. Should we use a continuity 
of counterfactuals? What happens if the "treatment" is multivariate? (Passive 
smoker in the work place, only once a hour, and once a week in pub. . . ) A 
function of time? Simple it ain't. 

Does the counterfactual point of view simplifies the terminology? The natural 
terminology of the standard model {Z,J^, {P^}) is that of conditional distribu- 
tion. Hence we ask, does the model Y = V-^ +T{Y'^ — Yc) has any additive value 
over: The conditional distribution of Y given T = has a different mean than 
its conditional distribution given T = 11 We cannot dispense the conditional 
terminology altogether because we need to talk on the conditional distribution 
of (T, Y'^ , Y'^) given the eovariate X. Hence, the counterfactual presentation 
does not simplifies the causality lexicon. It just adds new terms. 

Some may say that it adds. It tells us about two different statistical models: 
(F, J^, {QJJ/} and (Y,J^, {Qj}. These models arc useful if we want to consider 
the distribution of Y if the unit is "enforced" to be in one of the two groups. 
After all this is what we really want. We want to know what the impact of a 
new no-smoking rule will have on life expectancy. 

However, if we consider different statistical models, we could talk about a 
plentitude of them, not only on two. We can consider {Y,T, {Q^}) * € suppT, 
and certainly it may that Pj(y G A) = P§iY & A | T = t), and for this we do 
not need the counter factual terminology. 

This last presentation is a transduction over the standard statistical model 
which talks only about P^. We don't have manipulated data in the sample, 
and hence any conclusion on manipulation is beyond the statistical reasoning. 
Typically it is just a wishful thinking. See for example Bound, Jaeger and Baker 
(1995), for a causality argument based self-evident assumptions that happened 
not to be true. Wishes sometimes come true, but more often than not, they 
do not. Thus Pearl (2009) claims that "confounding bias cannot be detected 
or corrected by statistical methods alone" . More specifically. "This information 
must be provided by causal assumptions which identify relationships that remain 
invariant when external conditions change." But, most often than not, this is 
petitio principii, or begging the question. One start with causal assumptions 
which are the empirical conclusion is disguise. 
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7. Conclusions 

Transduction inference is unavoidable. One cannot avoid causality questions 
in the name of statistical integrity. One should realize that decision should 
be made, but one should recognize that they arc not based on statistical safe 
ground. Certainly, quoting P-values and confidence intervals may be done only 
for the sake of style. 

However, when we discussed the problems of prediction and scmi-supcrvised 
learning we argued that when done carefully, almost transductive arguments can 
be used. In fact, one can use procedures that work nicely if his assumptions are 
valid, and do not fail him if they arc not. If the time series is indeed stationary, 
then his predictors will be meaningful. If it is not, they would be trivial but 
right. The unlabeled data would be used to construct a better classifier if the 
right smoothness exists, otherwise it would not mislead the careful statistician. 
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